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, /^ l Abstract 

\^ Current computational methods for exon-intron structure prediction 

f\ from a cluster of transcript (EST, mRNA) data do not exhibit the time 

• ^H and space efficiency necessary to process large clusters of over than 20,000 

1-^ ESTs and genes longer than 1Mb. Guaranteeing both accuracy and efh- 

^yi ciency seems to be a computational goal quite far to be achieved, since 

I I accuracy is strictly related to exploiting the inherent redundancy of infor- 

mation present in a large cluster. 

We propose a fast method for the problem that combines two ideas: 
a novel algorithm of proved small time complexity for computing spliced 
alignments of a transcript against a genome, and an efficient algorithm 
that exploits the inherent redundancy of information in a cluster of tran- 
scripts to select, among all possible factorizations of EST sequences, those 
allowing to infer splice site junctions that are highly confirmed by the in- 
put data. The EST alignment procedure is based on the construction of 
maximal embeddings that are sequences obtained from paths of a graph 
structure, called Embedding Graph, whose vertices are the maxim,al pair- 
. . ings of a genomic sequence T and an EST P. The procedure runs in time 

^ linear in the size of P, T and of the output. 

k>( PIntron, the software tool implementing our methodology, is able to 

%^ process in a few seconds some critical genes that are not manageable by 

C^ other gene structure prediction tools. At the same time, PIntron exhibits 

high accuracy (sensitivity and specificity) when compared with ENCODE 
data. 

Detailed experimental data, additional results and PIntron software 
are available at http://www.algolab.eu/PIntron 
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1 Introduction 

A key step in the post-transcriptional modification process is called splicing 
and consists in the excision of the intronic regions of the premature mRNA 
(pre-mRNA) while the exonic regions are then reconnected to re-form a single 
continuous molecule, the mature mRNA. A complex regulatory system medi- 
ates the splicing process which, under different conditions, may produce alter- 
native mature mRNAs (also called transcript isoforms) starting from a single 
pre-mRNA molecule. Alternative Splicing (AS), i.e. the production of alterna- 
tive transcripts from the same gene, is the main mechanism responsible for the 
expansion of the transcriptome (the set of transcripts generated by the genome 
of one organism) in eukaryotes and it is also involved in the onset of several 
diseases [7]. 

A great extent of work has been performed to solve two basic problems 
on AS: characterizing the exon-intron structure of a gene and finding the set 
of different transcript isoforms that are produced from the same gene. Some 
computational approaches to these crucial problems have been proposed; indeed 
good implementations are available [TH [T71 [5T1 [THl [51 [E] . In this paper we focus 
on providing an algorithm - efficient from both a theoretical and an empirical 
point of view - to predict the exon-intron structure: a problem that is still 
unsolved by methods based on transcript data. 

Indeed, it must be pointed out that few efforts have been done in the di- 
rection of providing a formal framework to design efficient algorithms for the 
general AS prediction problem. As a matter of fact, existing tools are not able 
to process efficiently genes that are huge or with a very large set of associated 
clusters of ESTs ;8j. A basic reason of this fact is that combinatorial meth- 
ods for the problem must combine two different steps: (1) to produce putative 
spliced alignments of ESTs against the gene region and (2) to use redundancy 
and the whole covering of the gene region provided by a cluster in the prediction 
process by selecting among putative spliced alignments of multiple ESTs [2] the 
ones that confirm the same gene structure. The two steps are much harder 
when combined. In fact the literature provides efficient solutions of the first 
step when solved independently from the second one. Finding an alignment of 
an EST sequence could be a hard task when more than one alignment can exist 
for the same input data and a primary goal is the choice of the alignment that 
integrates biological meaningfulness w.r.t. the whole gene structure. On the 
other side the second step is NP-hard [J thus requiring efficient heuristics. 

In this paper we show how to efficiently solve the integration of the above two 
steps. First we design a new fast algorithm for producing splicing alignments of 
EST sequences by exploiting a new combinatorial formulation of the problem. 
Afterwards, given the spliced alignments for each EST sequence of a gene, for 
each EST we extract a composition that is consistent with a putative exon- 
intron structure of the given gene, by applying the Minimum Factorization 
Agreement (MFA) approach to the data produced by our spliced alignment 
algorithm [3j. Indeed, the MFA approach provides an effective method to extract 
the compositions in such a way that the whole gene structure is derived (as an 



EST sequence provides the information on a partial region of the whole gene) . 

2 System and Methods 

Our new combinatorial method for exon-intron structure prediction can be sum- 
marized as a four-stage pipeline where we: (1) compute and implicitly represent 
of all the spliced alignments of a transcript sequence (EST or mRNA) against 
a genomic reference sequence; (2) filter all biologically meaningful spliced align- 
ments; (3) reconcile the spliced alignments of a set of correlated transcript 
sequences into a consensus gene structure; (4) extract, classify and refine the 
resulting introns. In the following we describe the main idea on which the first 
two steps of our pipeline are built. 

A basic ingredient of most computational approaches for gene structure pre- 
diction is aligning several ESTs against the reference genomic sequence [HI [121 
[2H [ini HOI [S] taking into account the effects of the excision of the intronic re- 
gions. Thus, when considering ESTs, a particular kind of alignment problem 
arises: the spliced sequence alignment. The spliced sequence alignment problem 
requires to compute, given a sequence S and a reference sequence T, two sets 
Fs = {/i, ■ • ■ , fk} and Ft = {/{, ■ ■ ■ J'k} of strings such that S = /i • • • /fc, 
T = pf'iii ■ ■ -ik-ifi-S, and for each i, the edit distance between /^ and f- is 
small. The sequence of pairs (/i, f^) is called composition of S on T, each fac- 
tor fi is called spliced sequence factor (or EST factor), and each f- is called 
genomic factor (or exon). Clearly, in the biological context, the sequence S is 
an EST or a mRNA sequence, while the reference sequence T is the genomic 
sequence of the locus of the gene where the EST comes from. Also allowing a 
small edit distance between two factors is justified by the fact that EST data 
contain mismatches (deletions and insertions) against the genome because of 
sequencing errors and polymorphisms. 

Notice that allowing an approximate alignment between factors makes the 
spliced alignment problem computationally harder, especially when EST data 
and the genomic sequence are large. Moreover, multiple compositions can exist 
for the same input data and thus integrating biological meaningfulness is a 
primary goal. 

One of the main ideas of our approach is that the small edit distance between 
a sequence factor and a genomic factor implies that there exist some perfectly 
conserved subfactors (called pairings) between those factors. The problem of 
combining together such pairings is simplified by the introduction of a new 
data structure called Embedding Graph which is a graph representing all spliced 
alignments. Building and querying the Embedding Graph is not trivial and is 
the main subject of our algorithmic investigation discussed in the next section. 

The whole four-step pipeline method has been implemented and experi- 
mented. We designed our experimental analysis in two parts, according to the 
two goals that we wanted to achieve, namely to investigate the scalability of 
our implementation and to assess the quality of the results obtained. For each 
gene the associated human Unigene cluster has been processed by our spliced 



alignment algorithm, together with the whole ENCODE region. Indeed, we 
have assessed the quality of the results by comparing the outputs computed by 
our implementation with the gene structure data provided by ENCODE which 
aimed at providing a reliable annotation of 1% human genome (TH]- 

For the assessment of the prediction accuracy, we use two quality measures, 
sensitivity and specificity, commonly adopted for the evaluation of computa- 
tional gene-structure prediction tools |11) . We call benchmark set B and test 
set I respectively the data retrieved from ENCODE and the results computed 
by our approach. Sensitivity is defined as the ratio \B D I\/\B\), while speci- 
ficitjF] is defined as the ratio \B H /|/|/|. Notice that both ratios have values 
between and 1, and that higher values correspond to higher similarity between 
B and I. We will analyze the results produced by our pipeline according to two 
coordinates: the set of predicted introns and the set of predicted splice sites. 
When considering the set of predicted introns, an intron e E B belongs to BnJ 
if there exists a prediction i G I, perfectly matching e both on the donor and 
the acceptor splice sites on the genomic sequence. 

The first experiment has been run on a set of 112 fairly typical genes from 
13 ENCODE regions with lengths ranging from 0.5Mb to 1.7Mb encompassing 
98,064 transcripts (total length 63Mb), on which we have studied both the 
running times and the quality of the results. The second experiment, devoted 
to studying the scalability of our approach, has been performed on some hand- 
picked genes exhibiting a large genomic sequence or a large set of transcripts. 
More precisely, we have chosen 26 genes of which 11 are at least 1Mb long (on 
average about 848Kb), and 5 have more than 15,000 transcripts (on average 
more than 5000 transcripts). The complete list of regions and genes studied 
in both experiments, as well as the data supporting our analysis, is in the 
supplementary material. 

3 Methods 

3.1 Implicit Computation of Spliced Alignments 

The first stage of our gene structure prediction method computes the set of 
all possible spliced alignments of an EST (or mRNA) sequence against the 
genomic sequence. In particular, the first stage computes an implicit compact 
representation of the spliced alignments, which is then used by the second stage 
to extract all biologically plausible alignments. 

In our novel alignment method, we exploit a fundamental property of the 
notion of composition: the edit distance between each pair of corresponding 
factors is small. Therefore, there must exist some sufficiently long common 
substrings of the EST factors and the genomic factors. For example, a typical 
low-quality sequencing technology produces a 50bp-long exon with a 6% error 
rate. The worst scenario still allows for pairs of llbp-long substrings that per- 



^We adopted the definition of specificity used when evaluating gene-structure prediction 
tools, even if not standard [IJ 



fectly match. Clearly, if the sequence of perfectly matching pairs of substrings 
is known, it is quite easy to reconstruct a possible alignment of the factors. We 
call the sequence of the occurrences of perfectly matching pairs an embedding 
of the EST sequence in the genomic sequence. Notice that there might be more 
than one embedding of the EST sequence P in the genomic sequence T. We can 
compute efhciently all embeddings maintaining a graph whose vertices are all 
possible maximal pairings of P and T. A maximal pairing of P and T general- 
izes the notion of maximal pair of a sequence [13j and it provides the occurrence 
on P and T of a maximal common substring of P and T. The vertex set V can 
be computed in time linear in the sizes of P, T and V. Edges of the graph are 
computed in time at most 0(|yp). 

According to the traditional notation, given a string S — siS2 ■ • ■ Sq, we 
denote with \S\ its length and with 5'[i, j] the substring s^Si+i • • • Sj. A funda- 
mental notion is that of pairing of two strings. Given a pattern P and a text T, 
a pairing of P and T is a triplet (p, t, I) such that P[p,p+l — l] = T[t,t+l — l\. In 
other words, a pairing (p, t, I) represents a common substring x (or factor) of P 
and T of length I starting in positions p and t on P and T respectively. We call 
p and t the starting position on P and T respectively, p + 1 and t + I the ending 
position on P and T respectively, while I is the length of the pairing. When no 
ambiguities arise, the two strings P and T will be omitted when dealing with a 
pairing. 

Let / be the common factor represented by a pairing v — {p,t,l). Notice 
that all triplets {p+Si,t + 6i,l — 62), with < i5i < / and Si < S2 < l~Si are also 
pairings. It is natural to define an order < among pairings. Let vi — {pi,ti,li) 
and V2 = {P2,t2,l2) be two pairings, then vi :< V2 iS P2 < Pi < Pi+h < P2 + I2 
and Pi — P2 = ti ~ t2- When vi ^ V2 we say that Vi is a sub-pairing of V2, or 
V2 contains vi. Moreover, vi is a prefix-pairing [suffix-pairing, resp.) of V2 iff 
V\ ■< V2 and the starting positions (the ending positions, resp.) of vi and V2 on 
P and T are equal. 

Based on the order relation ^ we can define the concept of maximality of 
pairings. A pairing v is maximal if and only if it is neither a suffix-pairing nor 
a prefix-pairing of a distinct pairing v' . In other words, v = (p,t,l) is maximal 
if and only if P[p - 1] 7^ T[t - 1] and P[p + I] ^ T[t + I]. 

We recall that an embedding is a sequence of pairings. Therefore we can 
extend ^ to an order between embeddings and, based on this, we derive the 
notion of maximal embeddings. Given two embeddings e = (wi, • • • , u„) and 
e' = {v[,- ■ ■ ,v'^), then e is contained in e' (in short e :< e') if and only if for 
each Vi in e there exists a pairing u' in e' such that Vi ^ w' . Given the set £ of 
the embeddings of P in T, we say that e G £ is maximal iff there does not exist 
e' £ £, e j^ e', such that e :< e' . 

Not all embeddings induce a biologically meaningful composition. For ex- 
ample, an embedding constituted by several short pairings "scattered" along 
the genome cannot be considered a valid spliced alignment. A representa- 
tive embedding is a maximal embedding e — {vi, . . . ,Vm) such that U > Ie 
and pi+i - Pi - k < in, and either (i) |tj+i ~ t^ - p^+i + pi\ < Id or [ii) 
ti+i — ti — pi^i +pi > £1 is true - only representative embeddings might induce 



a biologically plausible composition. Intuitively, the parameter (.e is the mini- 
mum length of a pairing (in order to avoid the previous example), to regulates 
the maximum number of consecutive mismatches, and £/ represents the min- 
imum length of a valid intron. For each composition such that every f^j-long 
substring of the EST sequence is aligned with the corresponding genomic factors 
with up to j^ errors, there exists (at least) one representative embedding in- 
ducing such a composition (if we choose carefully the values of the parameters). 
This fact roughly means that spliced alignments with error rate at most — can 
be recovered from some representative embeddings. 

Therefore we propose the problem of finding all the representative embed- 
dings of P in T, formalized as the Representative Embedding Problem (RE), 
where we are given a pattern P, a text T, and three parameters (.e, (-d and (.j. 
The goal is to compute the set £r of the representative embeddings of P in T. 
We are now able to introduce the embedding graph, which is our main device 
for tackling the RE problem. 

Definition 3.1 (Embedding Graph). Given a pattern P and a text T, the 
embedding graph of P in T is a directed graph G = (V, E) such that the vertex- 
set V is composed by the set of maximal pairings of P and T longer than £ei 
and two pairings vi = (pi,ii,Zi) and V2 = {P2,t2,h) are connected by an edge 
(ui,W2) if and only if: (i) P2 - Pi - h < ^d, and (ii) 1^2 - h - P2 + Pi\ < ^d or 
h - ti - p2 + pi > ij. 

Basically the above conditions ensure that if two maximal pairings vi and 
V2 are connected by an edge in the embedding graph, then there exists a repre- 
sentative embedding e in which a sub-pairing of vi and a sub-pairing of V2 are 
consecutive. It is possible to prove that also the converse proposition is true: 
that is, if two pairings v[ and V2 are consecutive in a representative embedding 
e, then the maximal pairings vi and V2 which contain v[ and V2, respectively, are 
connected by an edge in the embedding graph. The previous properties derive 
from the maximality of the representative embeddings and from the uniqueness 
of the maximal pairing containing a pairing which belongs to a representative 
embedding. 

We designed an algorithm which builds the embedding graph of a pattern 
P and a text T in time 0{\T\ + \P\ + \V\'^). The algorithm is composed by two 
steps: In the first step, the vertex-set V is computed by visiting the suffix-tree 
of the text T. This step requires 0(|r|) time for the suffix-tree construction and 
0(|P| + |V^|) time for the computation of maximal pairings. In the second step. 



edges are then computed by checking the conditions of Definition 3.1 on each 
pair of maximal pairings, leading to a 0(|yp) procedure. Since the number of 
maximal pairings is usually very small compared to the length of P and T, the 
embedding graph construction procedure is efficient even on large patterns P 
and texts T. 
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Figure 1: Possible configurations of relative positions of two maximal pairings 
V, v' connected by an embedding graph edge. Each box represents a common 
maximal factor on T (top) and P (bottom) of a maximal pairing. Then v, v' 
are represented by two boxes connected by dotted lines. Four possible cases are 
presented: (a) v, v' overlap on both T and P, (b) v, v' overlap on T but not on 
P, (c) v,v' overlap on P but not on T, and (d) v,v' do not overlap neither on 
T nor on P. 



3.2 Extraction of Relevant Spliced Alignments 

The next stage of our pipeline is devoted to analyzing and mining the embed- 
ding graph to compute the representative embeddings that also induce distinct 
biologically meaningful compositions. Given two pairings that are connected by 
an edge in the embedding graph, the corresponding factors might be overlapping 
in the text or in the pattern, leading to four different configurations that are 
depicted in Fig. [T] 

Algorithm ComputeCompositions is a two-step procedure. Initially it ex- 
tracts a subset of representative embeddings by performing a visit of the cm- 
bedding graph. Then the algorithm computes the compositions by merging 
consecutive pairings that are separated by short gaps. Basic biological criteria 
(such as the recognition of canonical splice sites) are locally used, if possible, to 
resolve ambiguities in the exon-intron boundary determination. These bound- 
aries can be changed during the last stage of the pipeline according to some 
refinement criteria which globally analyze the set of introns induced by the 
spliced alignments of all the transcripts. 



Embedding graph visit. The first step of ComputeCompositions is a recur- 
sive visit of the embedding graph starting from a subset of vertices that we call 
extended sources. The visit of a vertex Vk from the extended source s recon- 
structs the set £ of biologically meaningful representative embeddings that are 
induced by the path 'P = (s, ui, . . . , Vk) traversed during the visit. Due to space 
constraints we are unable to provide the formal characterization of extended 
sources, as well as some technical details of the visit and a proof of the correct- 
ness of the step - each representative embedding has been examined, only the 
biologically meaningful ones have been added to set £, and the visits compute 
pairwise-distinct representative embeddings. 

During the visit of vertex Vk, we examine each outgoing edge (wfc,Wfc+i) 
and we "extend" each embedding e = (ui, . . . ,Uk) oi £. How the extension is 
performed depends on the overlapping of the involved factors, i.e. it depends on 
the four possible configurations of relative positions between the last pairing Uk 
of e and the new vertex Vk+i that are depicted in Fig. [T] In the exposition of 
the four cases, let Uk = (pk^tkjk) and Vk+i = {pk+i,tk+i,lk+i)- 

Case (a). Factors Uk and Vk+i overlap on both T and P. Two different sub- 
cases must be analyzed: If \{tk+i — tk) — {pk+i —Pk)\ < ^d, then the two pairings 
may belong to the same factor of the induced composition. Thus, the algorithm 
replaces pairing Uk in e with the shortest maximal prefix-pairing u'f, of Uk and 
the longest maximal sufhx-pairing Uk+i of w^+i, such that both u'/^ and Uk+i do 
not overlap and have lengths at least as large as £e- The second case is when 
(i/c+i — tk) — (pk+i —Pk) > ^i- In such case the two pairings might be separated 
by an intron. Thus, some basic biological criteria are used to compute two 
maximal sub-pairings u'^. of Uk and Uk+i of Vk+i that could represent a suffix 
and a prefix (respectively) of an exon. We replace Uk with u'j, and we extend 
the embedding with Uk+i- Whenever more than one such a pairing exists, the 
embedding e is extended into a set of embeddings, one for each pairing found. 

Case (b). Factors Uk and Vk+i overlap on T but not on P. This case is 
equivalent to the first sub-case of Case (a). 

Case (c). Factors Uk and Vk+i overlap on P but not on T. This case is similar 
to the entire Case (a). Notice that when the second subcase is relevant, that 
is (i/c+i — tk) — {Pk+i — Pk) > (-1, then the splice site placement is ambiguous 
because a suffix of the donor exon is equal to a prefix of the acceptor exon. 
Also in this case, basic biological criteria are used to reduce the impact of the 
ambiguity. 

Case (d). Factors Uk and Vk+i do not overlap neither on P nor on T. Let 
Gt and Gp be the two substrings which separate Uk and Vk+i on T and P, 
respectively. Since Gp and Gt do not form a pairing, they must contain a 
certain number of mismatches; we must evaluate if they support the hypothesis 
that (i) Uk and Vk+i are part of the same factor or (ii) there is an intron between 



Uk and Vk+i- Similarly to Case (a), two different sub-cases may arise: If \{tk+i — 
tk) — {pk+i ^ Pk)\ < ^_D, then Uk and Vk+i might belong to the same factor of 
the induced composition. More precisely, Uk and Vk+i belong to the same factor 
if the edit distance between Gt and Gp is below a certain threshold - in which 
case Vk+i is added to embedding e, otherwise the edge is discarded from the 
visit. Instead, if tk+i — tk —pk+i +Pk > ^i, the two pairings are separated by an 
intron, and we must determine the splice sites of such an intron. In this case, the 
algorithm computes a prefix Gip and a suffix G'^ of Gt, that minimize the edit 
distance between Gp and the concatenation of G^ and G^. Also in this case, 
if the resulting edit distance is larger than an acceptable threshold, the edge 
{vk, ffc+i) is discarded, otherwise Vk+i is added to e. Notice that computing the 
edit distance is not too expensive, since all strings involved are no longer than 

2iD. 



The definition of embedding graph (Def. 3.1 ) allows the presence of directed 
cycles, which potentially might be troublesome. However, cycles are of no bio- 
logical interest, as no actual embeddings can be associated to any cycle, therefore 
we simply ignore any edge ending in an already visited vertex. In fact, we claim 
that the embeddings, computed from a path V containing a cycle C, would in- 
duce (in the following step of this stage) compositions with essentially the same 
set of factors of the compositions induced by the embeddings computed from 
the visit of the simple path V\C. We support our claim with a simple example 
on paths containing a directed cycle of length 2, that is the presence of a pair 
(t^i, 112) and (w2, fi) of consecutive edges. Let V = {vi,V2,vi,V3) be such a path 
and, for simplicity, suppose that the relative position of vi and W3 is the one 
depicted in the second sub-case of Case (d) . Observe that the relative position 
of Vi and V2 must be the one of Case (a), whatever direction is considered, since 
a gap between the two pairings (on P or on T) would imply the absence of one 
of the parallel edges. Moreover, only the first sub-case of Case (a) may arise. 
After the visit of path V, the set £ of embeddings will contain an embedding 
e = {v'i,V2,Vi,V3) where v'l is a prefix-pairing of wi, V2 is contained in V2, and 
v'{ is a suffix-pairing of vi. As explained in the first sub-case of Case (a), the 
following step of composition reconstruction will "merge" pairings v[, Wj, and 
Vi in a single factor (exon) since they are not separated by a gap large enough 
to represent a plausible intron. Such a factor will be approximately equal to the 
factor represented by pairing vi because v[ and v" are, respectively, a prefix- 
pairing and a suffix-pairing of vi. Instead, the visit of simple path V' — (ui, W3) 
computes the embedding e' = (viyV^). Clearly, also in this case, one of the 
factors computed in the following step of composition reconstruction will be 
(approximately) the factor represented by pairing vi, concluding the example. 

The visit performed in the first step of algorithm ComputeCompositions 
guarantees that each possible representative embedding is analyzed. However, 
the biological criteria that we employ allow to consider only pairings belong- 
ing to biologically meaningful embeddings. Since the visit computes pairwise- 
distinct representative embeddings and every case presented above requires 0(1) 
time, the overall computational complexity of the visit is clearly bounded by 
0{J2ee£ I^D' ^^^^ ^^ ^^"^ total size of the representative embeddings that have 



been computed during the visit. 

Composition reconstruction. The set 8 of representative embeddings com- 
puted by the visit of the embedding graph directly leads to a set C of compo- 
sitions. In fact, the visit guarantees that two consecutive pairings of a rep- 
resentative embedding are either separated by a "small" gap due to errors 
or by a "large" gap representing an intron of the spliced alignments. Hence, 
the algorithm simply merges into a factor a sequence of consecutive pairings 
Vk = {Pk,tk,lk) and Vk+i = {pk+i,tk+ijk+i) separated by "small" gaps, that 
is \tk+i — tk — Pk+i + Pk\ < £d- Finally, the composition is retained if the edit 
distance between each EST factor and the corresponding genomic factor is not 
greater than a fixed acceptable threshold. 

3.3 Building a Gene Structure 

In the third stage of our pipeline, we compute a maximum-parsimony con- 
sensus gene-structure starting from the compositions of a cluster of transcript 
sequences against a common genomic sequence. Let T be the genomic sequence 
of a given gene locus and let S — {Pi, . . . ,Pk} be the cluster of transcript se- 
quences that map to the gene locus. The first two stages of our pipeline have 
considered separately each transcript sequence Pi and, for each of them, a set 
C{Pi) of biologically meaningful compositions has been computed. The main 
task is to extract a composition for each transcript that explains the putative 
gene structure. The redundancy of information, due to compositions of different 
transcripts in the cluster, is an important ingredient to discover a single com- 
position of each transcript that agrees with the gene structure. To achieve this 
goal we apply the Minimum Factorization Agreement (MFA) problem [3]. 

Let us recall the definition of the MFA problem. Let S" be a set of sequences 
over a finite alphabet S of symbols, and let F = (/i, /2, . . . , /|f|) be a finite 
ordered set of sequences over alphabet S, called factors. Given a sequence 
s £ S, a, factor- composition [f- composition in short) of s consists of the sequence 
{IiiJi2i • • • > /»„) such that s = /i^, /i2, • • • , /,„ and ij < ij+i for 1 < j < n - 1. 
While the notion of f-composition depends on the set of factors, such set of 
factors is usually clear from the context and is therefore omitted in the definition. 
Please notice that a sequence s can admit different f-compositions: thus let 
F{s) be the set of compositions of s. Moreover, by extension, we will denote by 
F{S) = UsgsF(s) the set of f-compositions of a set S of sequences. 

Given a f-composition / = {fi^, fi^, ■ ■ ■ , fi„ ) , the set {/ji , /ia , • • ■ , fi„ } is 
called the factor set of / and is denoted as F{f). Given a subset F' C F oi 
factors and the set F(S), then F' is a factorization agreement set for F(S) if 
and only if for each sequence s € S, there exists a f-composition / in F{s) such 
that its factor set is a subset of F', i.e. F{f) C F'. 

The Minimum Factorization Agreement problem, given a set F of factors 
and a set S of sequences, asks for a minimum cardinality subset F' C F such 
that F' is a factorization agreement set for F{S). Informally, the solution to the 
MFA problem is a smallest set of factors that is able to explain a f-composition 



10 



(a) Intron i 

G GACCCTGAAGCAGTAGTATCCCCAGGTA TAGGTGGTGTCGGGTTTGACATCAACTG 

Si GACCCTGAAGCAGTAGTATCCCCAG GTGGTGTCGGGTTTGACATCAACTG 

(b) Intron i' 

G GACCCTGAAGCAGTAGTATCCCCAGGTA TAGGTGGTGTCGGGTTTGACATCAACTG 

32 GACCCTGAAGCAGTAGTATCCCCAGG GGTGTCGGGTTTGACATCAACTG 

(c) Intron i 

G GACCCTGAAGCAGTAGTATCCCCAGGTA TAGGTGGTGTCGGGTTTGACATCAACTG 

32 GACCCTGAAGCAGTAGTATCCCCAG G-GGTGTCGGGTTTGACATCAACTG 



Figure 2: An example of intron reduction 



for each input sequence. In our setting S is the cluster of transcript sequences 
and F is the set of all exons (factors) used to produce the compositions of 
sequences in S, i.e. F{S) consists of all the compositions of S. When solving 
the MFA problem on these input data, the solution F' provides a minimum set 
of factors that can explain all transcript sequences and a single composition of 
each transcript can be obtained from set F' . 

However, in order to apply the MFA problem to our data we need to define a 
binary relation between factors (exons). Indeed, the first and the last factors of 
different transcript sequences could be fragments of the same exon. Moreover, 
since internal factors of transcript compositions are computed without applying 
refined biological criteria for the location of splice junctions, even internal fac- 
tors of different transcript compositions could be associated to the same exon 
whenever they differs by few bases. For this reason we define a binary relation 
between factors of compositions. More precisely, we say that two factors / and 
/' are ^-related if the two factors share a common overlapping regions of length 
at least N, for N a fixed bound. Factors that are ~-related are grouped into 
a region: each composition c of a transcript sequence s is then replaced by a 
composition into regions corresponding to each factor (recall that regions are 
disjoint). Similarly the set F consists of regions and thus the MFA problem 
applied to the new set F{S) produces a minimum set F' of regions that could 
explain all input compositions. 

By applying the algorithm proposed by [3] we are able to filter efficiently a 
set of spliced alignments agreeing to the same gene structure that are further 
refined by the next step of intron reduction. 

3.4 Intron Reduction 

Although the intron boundaries of EST spliced compositions are computed by 
finding the best transcript-genome alignment over the splice site regions and 
the most frequent intron pattern (i.e. the first and the last two nucleotides of 
an intron) according to [^ , the set of predicted introns may still contain false 
positives very close to true predictions. This can be explained by the example 
in Figure [2J Let us suppose that there exists an EST si (see Figure l2ja)) 
producing a canonical intron i - i.e. satisfying the so-called GT — AG rule [B] 
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- and that si perfectly matches to the genomic sequence over the intron sphce 
sites. Moreover, let us suppose that there exists another EST S2 (see Figure 
[2]jb)) with a deletion of a base 'T' with respect to si (highlighted in boldface in 
Figure l2{a)). Because of our spliced alignment algorithm, S2 produces a non- 
canonical intron i' (indeed its pattern is TA—GT), which differs from i by only a 
few bases. We can reasonably assume that i' is a false positive and not a reliable 
prediction. Hence, the spliced composition of S2 can be corrected into the spliced 
alignment supporting the intron i (see Figure l2rc)), by introducing just one 
insertion error into the genomic sequence. Thus, a procedure for comparing the 
intron set computed by the EST spliced compositions in order to correct and 
reduce the set of false positives (i.e. over predicted introns) is necessary. 

In the following, let the pair {i, s) denotes a genomic intron (eventually 
specified by a pair of genomic coordinates) and a spliced composition of an EST 
s supporting the intron «, i.e. having two consecutive factors /i, fi+i inducing 
intron i when aligned to the genome. Then, given an error bound b, we say that 
{i, s) is 6-reducible to (i', s) iff there exists a boundary shift of factors fi and 
/i+i of a new spliced composition of s inducing intron i' with at most additional 
b errors w.r.t. the previous alignment of the two factors against the genome. 
Since RefSeq transcripts are usually full-length and error-free, and GT — AG, 
GG — AG and AT — AC rules are the most frequent |B] and are associated to 
U12/U2 introns [IH], we assume that only introns, that do not follow one of the 
U12/U2 rules and are not supported by a RefSeq transcript should be reduced. 
The input of our intron-reduction procedure is a set X of pairs (i, s) computed 
by the previous steps. Then, R is the set of pairs in X such that s is a RefSeq, 
Ci, C2, C3 and A^ are the set of pairs in X \ i? following the GT-AG, GG-AG, 
AT — AC and a non-U12/U2 rule respectively. Our procedure basically tries to 
reduce elements in N to some intron in R (i.e. the most reliable class composed 
by RefSeq-supported introns) and if this is not possible tries to reduce to some 
element in the first set of the sequence Ci, C2 , C3 that allows the reduction. 



4 Implementation 

We implemented the pipeline as a set of C programs in the software package 
PIntron. The input is a genomic sequence G and a set T of EST and niRNA 
sequences, referred in the following with the more general term of transcripts. 
Our method computes a set C of exon compositions of the sequences in S with 
respect to the genomic sequence, such that there exists one exon composition 
per input transcript. In fact, for each transcript, our software retains only 
those compositions that can possibly provide information about the exon-intron 
structure. Our implementation is still at a preliminary stage; in fact only a few 
basic biological criteria have been introduced, such as constraints on the intron 
lengths as well as on the size of removable transcript prefixes and suffixes. The 
main goal of our study - and of our current implementation - is to assess the 
efficiency of the pipeline while maintaining a reasonable quality of the results 
produced. Together with the exon compositions, PIntron outputs the positions. 
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Figure 3: Distribution of ENCODE introns (left) and splicing sites (right) con- 
firmed by PIntron. For each region we have represented the mean of sensitivity 
and specificity values, together with the interval ranging from the mean minus 
the standard deviation and the mean plus the standard deviation. 



on the genomic sequence, of the donor and the acceptor splice sites of the introns, 
as well as several additional informations included the intron type (U12, U2 or 



unclassified) 



5 Discussion 



The results of the experimental analysis are very positive, especially the running 
times, as the analysis of several genes has required only a few hours, even 
using modest computational resources. After a first experiment, where we have 
analyzed a set of 112 fairly typical genes from 13 ENCODE regions with lengths 
ranging from 0.5Mb to 1.7Mb encompassing 98,064 transcripts (with total length 
63Mb) , we have decided to investigate the scalability of our implementation by 
analyzing some hand-picked genes exhibiting a large genomic sequence or a large 
set of transcripts. Comprehensive data regarding the experiments are reported 
in the supplementary material. 

In the first experiment, for each gene the associated human Unigene cluster 
(including also RefSeq mRNAs) has been processed by our spliced alignment 
algorithm, together with the whole ENCODE region. We tested our software 
on an off-the-shelf PC with a total running time of Ih 17sec (on average 41 
seconds/gene) . 

On the whole set of the 112 input genes, 1787 out of 1957 benchmark introns 
are also in the test set with an average support of 214.45 transcripts per intron 
(not including outlier genes RPLIO and EEFlAl that have 6,414 and 49,936 
transcripts respectively). We have further investigated the 170 ENCODE in- 
trons that have not been predicted by our method, searching in the ENSEMBL 
Genome Browser the supporting transcripts of each unpredicted intron. The in- 
vestigation pointed out that 95 introns (55.88% over the total of missing data) 
have no evidence in the ENSEMBL database, hence they are probably hand- 
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Figure 4: Distribution of ENCODE introns (left) and splicing sites (right) con- 
firmed by PIntron. Each gene is represented by a point whose coordinates are 
its specificity and its sensitivity. 

annotated introns. Of the remaining 75 unpredicted introns, 41 (24.12%) have 
no supporting transcripts in the Unigene set fed to our implementation. 18 
unpredicted introns (10.59%) have some supporting transcripts, but with only 
low-quality alignments between each transcript and the unpredicted intron. For 
most such transcripts our method computes a high-quality alignment support- 
ing another ENCODE intron that is very close to the unpredicted one. Finally, 
for 16 (9.41%) unpredicted introns our method is unable to align the transcripts 
supporting such introns (in one case we found errors in the Unigene DB). The 
overall sensitivity of our method is 0.9813, computed after discarding from the 
benchmark set the 136 ENCODE introns that have no evidence in ENSEMBL 
DB or have no supporting transcripts in the Unigene cluster given in input to 
PIntron. The benchmark set contains no intron for F8A1, H2AFB1 and IER5L. 
Our method confirms ENCODE for the first two genes, while it predicts 3 in- 
trons for IER5L. Notice that all genes (except OPNIMW) have sensitivity at 
least 0.8 and almost 75% have sensitivity 1. 

Our experiment has predicted 1042 introns that are not in the benchmark. 
The overall specificity of our method is 0.6317 and 10 genes have specificity 
equal to 1. Moreover, the average number of supporting transcripts for each 
new intron is 2.42. 268 new predictions have one of the two splice sites in 
the benchmark set B, while 92 new introns have both splice sites in B (but 
they belong to different ENCODE introns). Sensitivity and specificity for this 
analysis are represented in Figures [3] and |4] 

We have also studied if the introns predicted by PIntron are close to those 
predicted by ENCODE, relaxing the definition of introns common to the bench- 
mark and test sets by allowing splice sites that differ for up to 8 positions. We 
omit the results as no significant improvement has resulted. 

Since our pipeline seems to overpredict with respect to ENCODE, we have 
started a detailed analysis to understand the causes of such overpredictions. 
Almost all introns predicted by PIntron and ENCODE (1778 out of 1787) are 
canonical, classified as U12/U2 and contain a Branch Point Sequence (BPS). 
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Since we have not completed our analysis yet, we are currently unable to deter- 
mine if being canonical, classified or containing a BPS should be a requirement 
for a putative intron to be retained in the results output by PIntron. For ex- 
ample, introducing the criterion that only canonical and classified are output, 
the number of new predictions decreases from 1042 to 565: in such case the 
sensitivity becomes 0.9764 and the specificity becomes 0.7598. 

Other comparisons among gene-structure prediction tools in literature 0] are 
performed over different datasets, therefore that analysis and ours are not com- 
pletely comparable. Anyway, for illustrative purposes we report that two com- 
monly used software packages for predicting the exon-intron structure, namely 
ECgene [TB] and ASPic |Sj, have sensitivity/specificity that is respectively 0.92/0.63 
and 0.88/0.77 [1]: not as good as those obtained by our pipeline. Another pa- 
rameter that is surely interesting is the number of EST supporting a exon/intron. 

The time performance of PIntron over the experimented ENCODE genes is 
detailed in the supplementary material. In this section we will show the results 
only for three genes that are critical because of the size of the input transcripts 
or because of the length of the input genome (i.e. the corresponding ENCODE 
region). We will report only the time for computing spliced compositions, since 
it is the most time-consuming part of our pipeline. More precisely, gene TIMP3 
has taken 48 seconds for computing spliced compositions of 1,700 transcripts 
against the ENCODE region ENm004 that is 1,700,000 bp long. Gene RPLIO 
(in region ENm006) has taken 141 sec for aligning 6,414 transcripts against a 
genome of 1,338, 44 7bp, and finally the time, for computing the spliced composi- 
tions of the 49,936 transcripts of gene EEFlAl against the 500,000bp of region 
ENr223, has been 416 sec. 

To determine the scalability of our approach, we have chosen 26 genes, of 
which 11 are at least 1Mb long (on average about 848Kb), and 5 have more 
than 15,000 transcripts (on average more than 5000 transcripts). We tested 
our software on a commercial workstation equipped with 12GB of RAM with a 
total running time of 20 min (average 47 seconds/gene). Notice that processing 
TTN gene has taken 545 seconds (while other tools, such as the aforementioned 
ECgene, do not provide predictions for this gene). The likely reason is that the 
input set contains transcripts (ESTs and mRNAs) that are more than 80Kb 
long, and the spliced alignment computation time is very high for long EST 
sequences since their quality is lower than the quality of mRNA sequences. 

An analysis on the running times from both experiments has not shown any 
significant correlation between the length of the genes and the running times, 
hence confirming our conjecture that the behaviour of our algorithm depends 
on some properties of the embedding graph, and not on the size of the instance. 
In particular, the structure of the embedding graph is strictly related to the 
quality of the transcripts and to the presence in the gene of repetitions, highly 
duplicated regions or other elements that could influence the size of the graph. 
The results have confirmed our beliefs, since the average running time of the 
second experiment (47 seconds/gene), albeit on a faster PC, is not too far from 
the running times on the much smaller genes of the first experiment, where the 
average value is 41 seconds/gene. The experimental results for all investigated 
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genes are available at http://www.algolab.eu/PIntron 
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